Title: Socioeconomic and Demographic Impacts on Average Cancer Diagnosis and Deaths per Year in the US
Author: Alexa Neal
In this project, I wanted to explore how various non-biological factors are related to the average number of cancer cases diagnosed, and the number average cancer deaths in the US. One study found that individuals on Medicare with low incomes had greater challenges in affording their healthcare (Park et al. 2025), highlighting that factors other than genetics and biology can impact health outcomes. This also showcases how certain populations are vunerable to worse outcome due to lack of access to healthcare. Discerning these variables can be useful in understanding barriers to important medical resources and how to improve access.
To explore these concepts, I found two data sets on Kaggle, one containing health-related information and the other containing demographic information, and combined them. I utilized exploratory data analysis and multiple linear regression models to understand what factors correlate with average number of cancer cases diagnosed and the average number of cancer deaths per year. My research questions were:
avganncount: Average number of cancer cases diagnosed annually
avgdeathsperyear: Average number of deaths due to cancer per year
medincome: Median income in the region
povertypercent: Percentage of population below the poverty line
pctprivatecoveragealone: Percentage of population covered by private health insurance alone
pctempprivcoverage: Percentage of population covered by employee-provided private health insurance
pctpubliccoveragealone: Percentage of population covered by public health insurance only
pctwhite: Percentage of White population
pctblack: Percentage of Black population
pctasian: Percentage of Asian population
target_deathrate: Target death rate due to cancer
incidencerate: Incidence rate of cancer
popest2015: Estimated population in 2015
studypercap: Per capita number of cancer-related clinical trials conducted
binnedinc: Binned median income
medianage: Median age in the region
pctpubliccoverage: Percentage of population covered by public health insurance
pctotherrace: Percentage of population belonging to other races
pctmarriedhouseholds: Percentage of married households
birthrate: Birth rate in the region
statefips: The FIPS code representing the state
countyfips: The FIPS code representing the county or census area within the state
avghouseholdsize: The average household size in the region
geography: The geographical location, typically represented as the county or census area name followed by the state name
Before performing any analysis, I used plot_intro() to visualize the data set and realized there were missing values. I utilized plot_missing() to see what columns these values were in. To clean up the data, I removed missing values from pctprivatecoveragealone and pctemployed16_over as they were variables of interest to me. I also completely removed pctsomecol18_24 due to the large number of missing values.
log_avganncount pctprivatecoveragealone
log_avganncount 1.00000000 0.3289180
pctprivatecoveragealone 0.32891796 1.0000000
pctempprivcoverage 0.37937251 0.9297679
pctpubliccoveragealone -0.15487012 -0.8562727
medincome 0.34756921 0.7891926
povertypercent -0.21782367 -0.7604732
pctwhite -0.08490097 0.3070070
pctblack 0.03612490 -0.2740153
pctasian 0.38219128 0.2843640
pctempprivcoverage pctpubliccoveragealone medincome
log_avganncount 0.3793725 -0.1548701 0.3475692
pctprivatecoveragealone 0.9297679 -0.8562727 0.7891926
pctempprivcoverage 1.0000000 -0.7344268 0.7540889
pctpubliccoveragealone -0.7344268 1.0000000 -0.7195812
medincome 0.7540889 -0.7195812 1.0000000
povertypercent -0.6846717 0.7981458 -0.7866990
pctwhite 0.2689122 -0.3665684 0.1633035
pctblack -0.2418814 0.3329023 -0.2643873
pctasian 0.2882007 -0.1812346 0.4141817
povertypercent pctwhite pctblack pctasian
log_avganncount -0.2178237 -0.08490097 0.03612490 0.38219128
pctprivatecoveragealone -0.7604732 0.30700697 -0.27401532 0.28436398
pctempprivcoverage -0.6846717 0.26891221 -0.24188144 0.28820069
pctpubliccoveragealone 0.7981458 -0.36656842 0.33290227 -0.18123465
medincome -0.7866990 0.16330351 -0.26438728 0.41418170
povertypercent 1.0000000 -0.51104550 0.51769900 -0.14739052
pctwhite -0.5110455 1.00000000 -0.83069477 -0.27463578
pctblack 0.5176990 -0.83069477 1.00000000 0.02231497
pctasian -0.1473905 -0.27463578 0.02231497 1.00000000
log_avgdeathsperyear pctprivatecoveragealone
log_avgdeathsperyear 1.00000000 0.2146236
pctprivatecoveragealone 0.21462355 1.0000000
pctempprivcoverage 0.31328784 0.9297679
pctpubliccoveragealone -0.01207791 -0.8562727
medincome 0.27524653 0.7891926
povertypercent -0.08250517 -0.7604732
pctwhite -0.18305899 0.3070070
pctblack 0.13999944 -0.2740153
pctasian 0.42229093 0.2843640
pctempprivcoverage pctpubliccoveragealone medincome
log_avgdeathsperyear 0.3132878 -0.01207791 0.2752465
pctprivatecoveragealone 0.9297679 -0.85627271 0.7891926
pctempprivcoverage 1.0000000 -0.73442681 0.7540889
pctpubliccoveragealone -0.7344268 1.00000000 -0.7195812
medincome 0.7540889 -0.71958117 1.0000000
povertypercent -0.6846717 0.79814584 -0.7866990
pctwhite 0.2689122 -0.36656842 0.1633035
pctblack -0.2418814 0.33290227 -0.2643873
pctasian 0.2882007 -0.18123465 0.4141817
povertypercent pctwhite pctblack pctasian
log_avgdeathsperyear -0.08250517 -0.1830590 0.13999944 0.42229093
pctprivatecoveragealone -0.76047315 0.3070070 -0.27401532 0.28436398
pctempprivcoverage -0.68467172 0.2689122 -0.24188144 0.28820069
pctpubliccoveragealone 0.79814584 -0.3665684 0.33290227 -0.18123465
medincome -0.78669897 0.1633035 -0.26438728 0.41418170
povertypercent 1.00000000 -0.5110455 0.51769900 -0.14739052
pctwhite -0.51104550 1.0000000 -0.83069477 -0.27463578
pctblack 0.51769900 -0.8306948 1.00000000 0.02231497
pctasian -0.14739052 -0.2746358 0.02231497 1.00000000
The distribution of average annual cancer diagnoses and average annual cancer deaths are both skewed right, indicating taking the log will be useful. For the rest of this analysis, I used the log of average annual cancer diagnoses and the log of average annual cancer deaths.
Insurance has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with private and employee-provided coverage having a positive correlation, and public coverage having a negative correlation.
Median income and poverty percentage have a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the former having a positive correlation, and the latter having a negative correlation.
Race has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the percentage of the White population having a negative correlation, and the percentage of Black and Asian populations having a positive correlation.
The correlation coefficients confirm these observations.
As mentioned previously, data cleaning was performed to remove missing observations from pctprivatecoveragealone and pctemployed16_over, and to completely remove pctsomecol18_24. Furthermore, incidencerate and and target_deathrate were removed since they represent aspects of cancer frequency or mortality already shown in my response variables (avganncount and avgdeathsperyear). Similarly, log(avgdeathsperyear) was removed from the model for log(avganncount), and vice versa. Before fitting the models, binnedinc, geography, statefips, and countyfips were removed since they are non-numerical and non-quantitative values. Finally, since the histograms of avganncount and avgdeathsperyear displayed a skewed right distribution, I used the log of both response variables to make the data better suited for linear regression.
Since the predictors are continuous, linear regression was used, and the predictors were selected using backward variable selection.
Call:
lm(formula = log_avganncount ~ popest2015 + povertypercent +
studypercap + medianagemale + medianagefemale + percentmarried +
pctnohs18_24 + pcths18_24 + pcths25_over + pctbachdeg25_over +
pctunemployed16_over + pctprivatecoverage + pctprivatecoveragealone +
pctempprivcoverage + pctpubliccoveragealone + pctwhite +
pctblack + pctasian + pctmarriedhouseholds + birthrate +
avghouseholdsize, data = cancer_diagnoses)
Residuals:
Min 1Q Median 3Q Max
-8.2711 -0.6496 -0.0186 0.5772 3.7385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.402e+00 1.074e+00 2.236 0.025442 *
popest2015 1.254e-06 7.481e-08 16.763 < 2e-16 ***
povertypercent -6.722e-02 8.363e-03 -8.038 1.44e-15 ***
studypercap 7.249e-05 3.870e-05 1.873 0.061207 .
medianagemale -8.905e-02 1.279e-02 -6.965 4.27e-12 ***
medianagefemale 3.598e-02 1.366e-02 2.634 0.008502 **
percentmarried 3.386e-02 1.035e-02 3.273 0.001081 **
pctnohs18_24 -8.555e-03 3.402e-03 -2.514 0.011993 *
pcths18_24 -4.714e-03 2.934e-03 -1.606 0.108346
pcths25_over -2.815e-02 5.720e-03 -4.921 9.24e-07 ***
pctbachdeg25_over 1.973e-02 9.012e-03 2.190 0.028634 *
pctunemployed16_over 7.580e-02 9.786e-03 7.746 1.41e-14 ***
pctprivatecoverage 8.604e-02 1.038e-02 8.285 < 2e-16 ***
pctprivatecoveragealone -9.620e-02 1.360e-02 -7.076 1.96e-12 ***
pctempprivcoverage 7.073e-02 7.468e-03 9.471 < 2e-16 ***
pctpubliccoveragealone 8.708e-02 9.081e-03 9.590 < 2e-16 ***
pctwhite 1.401e-02 3.510e-03 3.991 6.79e-05 ***
pctblack 1.142e-02 3.360e-03 3.397 0.000692 ***
pctasian 4.316e-02 1.160e-02 3.721 0.000204 ***
pctmarriedhouseholds -6.061e-02 1.080e-02 -5.614 2.21e-08 ***
birthrate -3.227e-02 1.144e-02 -2.822 0.004816 **
avghouseholdsize 4.366e-01 1.962e-01 2.225 0.026190 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.054 on 2310 degrees of freedom
Multiple R-squared: 0.4665, Adjusted R-squared: 0.4616
F-statistic: 96.18 on 21 and 2310 DF, p-value: < 2.2e-16
Call:
lm(formula = log_avgdeathsperyear ~ popest2015 + povertypercent +
medianagemale + pctnohs18_24 + pcths25_over + pctbachdeg25_over +
pctemployed16_over + pctunemployed16_over + pctprivatecoverage +
pctprivatecoveragealone + pctempprivcoverage + pctpubliccoveragealone +
pctwhite + pctblack + pctasian + pctmarriedhouseholds + birthrate +
avghouseholdsize, data = cancer_deaths)
Residuals:
Min 1Q Median 3Q Max
-9.0593 -0.4928 0.0432 0.5655 2.3892
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.008e+00 8.557e-01 4.684 2.98e-06 ***
popest2015 1.325e-06 6.027e-08 21.991 < 2e-16 ***
povertypercent -8.044e-02 7.079e-03 -11.364 < 2e-16 ***
medianagemale -6.800e-02 6.190e-03 -10.986 < 2e-16 ***
pctnohs18_24 -1.564e-02 2.667e-03 -5.864 5.17e-09 ***
pcths25_over -6.803e-03 4.475e-03 -1.520 0.128524
pctbachdeg25_over 5.250e-02 7.296e-03 7.196 8.32e-13 ***
pctemployed16_over -1.749e-02 4.610e-03 -3.795 0.000152 ***
pctunemployed16_over 7.884e-02 8.304e-03 9.494 < 2e-16 ***
pctprivatecoverage 5.329e-02 8.348e-03 6.384 2.08e-10 ***
pctprivatecoveragealone -1.066e-01 1.118e-02 -9.528 < 2e-16 ***
pctempprivcoverage 8.397e-02 5.969e-03 14.069 < 2e-16 ***
pctpubliccoveragealone 7.903e-02 7.294e-03 10.835 < 2e-16 ***
pctwhite 2.205e-02 2.821e-03 7.818 8.07e-15 ***
pctblack 2.127e-02 2.677e-03 7.946 2.98e-15 ***
pctasian 6.347e-02 9.391e-03 6.758 1.76e-11 ***
pctmarriedhouseholds -2.684e-02 4.939e-03 -5.435 6.06e-08 ***
birthrate -5.703e-02 9.185e-03 -6.209 6.30e-10 ***
avghouseholdsize 2.367e-01 1.311e-01 1.806 0.071057 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8516 on 2313 degrees of freedom
Multiple R-squared: 0.5845, Adjusted R-squared: 0.5813
F-statistic: 180.8 on 18 and 2313 DF, p-value: < 2.2e-16
Other socioeconomic and demographic factors can be explored that were not present in this data set. For example, collecting data on the number of people who went through with treatment could correlate to the average number of cancer deaths per year. Furthermore, it would be important to know whether the number of people who went through treatment was related to the cost or availability of the treatment.
Additionally, although the predictors were significant, the low R-squared value indicates that there are other variables not in the data that could correlate with cancer deaths and diagnoses. Genetics and other biological factors could be significantly impacting health outcomes relating to cancer.
---
title: "Cancer Analysis"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
bootswatch: lumen
orientation: columns
vertical_layout: fill
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(DT)
library(tidyverse)
library(pacman)
library(dplyr)
library(DataExplorer)
library(car)
library(leaps)
library(MASS)
```
Introduction
===
Column {data-width=450}
---
### Background
**Title:** Socioeconomic and Demographic Impacts on Average Cancer Diagnosis and Deaths per Year in the US
**Author:** Alexa Neal
In this project, I wanted to explore how various non-biological factors are related to the average number of cancer cases diagnosed, and the number average cancer deaths in the US. [One study](https://pmc.ncbi.nlm.nih.gov/articles/PMC12455370/) found that individuals on Medicare with low incomes had greater challenges in affording their healthcare (Park et al. 2025), highlighting that factors other than genetics and biology can impact health outcomes. This also showcases how certain populations are vunerable to worse outcome due to lack of access to healthcare. Discerning these variables can be useful in understanding barriers to important medical resources and how to improve access.
### Research Questions
To explore these concepts, I found two data sets on [Kaggle](https://www.kaggle.com/datasets/varunraskar/cancer-regression), one containing health-related information and the other containing demographic information, and combined them. I utilized exploratory data analysis and multiple linear regression models to understand what factors correlate with average number of cancer cases diagnosed and the average number of cancer deaths per year. My research questions were:
1. Which socioeconomic and demographic factors are most strongly associated with the average number of cancer diagnoses per year across U.S. counties?
2. Which socioeconomic and demographic factors are most strongly associated with the average number of cancer deaths per year across U.S. counties?
Column {.tabset data-width=550}
---
### Variables of Interest
- avganncount: Average number of cancer cases diagnosed annually
- avgdeathsperyear: Average number of deaths due to cancer per year
- medincome: Median income in the region
- povertypercent: Percentage of population below the poverty line
- pctprivatecoveragealone: Percentage of population covered by private health insurance alone
- pctempprivcoverage: Percentage of population covered by employee-provided private health insurance
- pctpubliccoveragealone: Percentage of population covered by public health insurance only
- pctwhite: Percentage of White population
- pctblack: Percentage of Black population
- pctasian: Percentage of Asian population
### Other Variables
- target_deathrate: Target death rate due to cancer
- incidencerate: Incidence rate of cancer
- popest2015: Estimated population in 2015
- studypercap: Per capita number of cancer-related clinical trials conducted
- binnedinc: Binned median income
- medianage: Median age in the region
- pctpubliccoverage: Percentage of population covered by public health insurance
- pctotherrace: Percentage of population belonging to other races
- pctmarriedhouseholds: Percentage of married households
- birthrate: Birth rate in the region
- statefips: The FIPS code representing the state
- countyfips: The FIPS code representing the county or census area within the state
- avghouseholdsize: The average household size in the region
- geography: The geographical location, typically represented as the county or census area name followed by the state name
### Data Cleaning
Before performing any analysis, I used plot_intro() to visualize the data set and realized there were missing values. I utilized plot_missing() to see what columns these values were in. To clean up the data, I removed missing values from pctprivatecoveragealone and pctemployed16_over as they were variables of interest to me. I also completely removed pctsomecol18_24 due to the large number of missing values.
#### Introduction to Data
```{r cleaning}
## reading + joining the data
household <- read.csv("C:/Users/write/OneDrive/Desktop/school/regression files/avg-household-size.csv")
cancer_reg <- read.csv("C:/Users/write/OneDrive/Desktop/school/regression files/cancer_reg (1).csv")
cancer <- left_join(cancer_reg, household)
plot_intro(cancer)
```
#### Missing Value Distribution
```{r missing value distribution}
plot_missing(cancer)
cancer <- cancer %>%
drop_na(pctprivatecoveragealone, pctemployed16_over) %>%
dplyr::select(-c(pctsomecol18_24))
```
EDA
===
Column {.tabset data-width=600}
---
### Diagnosis
```{r diagnosis histogram}
ggplot(cancer, aes(x = avganncount)) +
geom_histogram(fill = "lightblue") + labs(title = "Distribution of Cancer Diagnoses",
x = "Average Diagnoses Per Year", y = "Frequency")
```
### Deaths
```{r death histogram}
ggplot(cancer, aes(x = avgdeathsperyear)) +
geom_histogram(fill = "lightblue") +
labs(title = "Distribution of Cancer Deaths",
x = "Average Deaths Per Year", y = "Frequency")
```
### Insurance
#### Average Diagnoses
```{r insurance diagnosis}
cancer$log_avganncount <- log(cancer$avganncount)
pairs(~log_avganncount + pctprivatecoveragealone
+ pctempprivcoverage + pctpubliccoveragealone, data = cancer)
```
#### Average Deaths
```{r insurance death}
cancer$log_avgdeathsperyear <- log(cancer$avgdeathsperyear)
pairs(~log_avgdeathsperyear + pctprivatecoveragealone
+ pctempprivcoverage + pctpubliccoveragealone, data = cancer)
```
### Income
#### Average Diagnoses
```{r income diagnoses}
pairs(~log_avganncount + medincome + povertypercent, data = cancer)
```
#### Average Deaths
```{r income death}
pairs(~log_avgdeathsperyear + medincome + povertypercent, data = cancer)
```
### Race
#### Average Diagnoses
```{r race diagnoses}
pairs(~log_avganncount + pctwhite + pctblack + pctasian,
data = cancer)
```
#### Average Deaths
```{r race death}
pairs(~log_avgdeathsperyear + pctwhite + pctblack + pctasian,
data = cancer)
```
### Correlation
#### Average Diagnoses
```{r diagnoses correlation}
cor(cancer[,c("log_avganncount", "pctprivatecoveragealone", "pctempprivcoverage", "pctpubliccoveragealone", "medincome", "povertypercent", "pctwhite", "pctblack", "pctasian")])
```
#### Average Deaths
```{r death correlation}
cor(cancer[,c("log_avgdeathsperyear", "pctprivatecoveragealone", "pctempprivcoverage", "pctpubliccoveragealone", "medincome", "povertypercent", "pctwhite", "pctblack", "pctasian")])
```
Column {data-width=400}
---
### Analysis
The distribution of average annual cancer diagnoses and average annual cancer deaths are both skewed right, indicating taking the log will be useful. For the rest of this analysis, I used the log of average annual cancer diagnoses and the log of average annual cancer deaths.
Insurance has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with private and employee-provided coverage having a positive correlation, and public coverage having a negative correlation.
Median income and poverty percentage have a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the former having a positive correlation, and the latter having a negative correlation.
Race has a low correlation with both the log of cancer diagnoses and the log of cancer deaths, with the percentage of the White population having a negative correlation, and the percentage of Black and Asian populations having a positive correlation.
The correlation coefficients confirm these observations.
Methods
===
Column {data-width = 500}
---
### Data Cleaning
As mentioned previously, data cleaning was performed to remove missing observations from pctprivatecoveragealone and pctemployed16_over, and to completely remove pctsomecol18_24. Furthermore, incidencerate and and target_deathrate were removed since they represent aspects of cancer frequency or mortality already shown in my response variables (avganncount and avgdeathsperyear). Similarly, log(avgdeathsperyear) was removed from the model for log(avganncount), and vice versa. Before fitting the models, binnedinc, geography, statefips, and countyfips were removed since they are non-numerical and non-quantitative values. Finally, since the histograms of avganncount and avgdeathsperyear displayed a skewed right distribution, I used the log of both response variables to make the data better suited for linear regression.
### Models Fit
Since the predictors are continuous, linear regression was used, and the predictors were selected using backward variable selection.
- multiple linear regression for log(avganncount)
- multiple linear regression for log(avgdeathsperyear)
Modeling
===
Column {.tabset data-width=600}
---
### Average Cancer Diagnoses
```{r}
cancer_diagnoses <- cancer %>%
dplyr::select(-c(binnedinc, geography, statefips, countyfips,
avganncount, avgdeathsperyear, target_deathrate,
incidencerate, log_avgdeathsperyear))
full.cancer.diagnoses <- lm(log_avganncount ~ ., data = cancer_diagnoses)
fit.backward.diagnoses <- stepAIC(full.cancer.diagnoses, direction = "backward", trace = FALSE)
summary(fit.backward.diagnoses)
```
### Average Cancer Deaths
```{r}
cancer_deaths <- cancer %>%
dplyr::select(-c(binnedinc, geography, statefips, countyfips,
avganncount, avgdeathsperyear, target_deathrate,
incidencerate, log_avganncount))
full.cancer.deaths <- lm(log_avgdeathsperyear ~ ., data = cancer_deaths)
fit.backward.deaths <- stepAIC(full.cancer.deaths, direction = "backward", trace = FALSE)
summary(fit.backward.deaths)
```
Column {data.width = 400}
---
### Analysis
Conclusions
===
Column {data.width=500}
---
### Discussion
### Limitations
Column {data.width=500}
---
### Future Directions
Other socioeconomic and demographic factors can be explored that were not present in this data set. For example, collecting data on the number of people who went through with treatment could correlate to the average number of cancer deaths per year. Furthermore, it would be important to know whether the number of people who went through treatment was related to the cost or availability of the treatment.
Additionally, although the predictors were significant, the low R-squared value indicates that there are other variables not in the data that could correlate with cancer deaths and diagnoses. Genetics and other biological factors could be significantly impacting health outcomes relating to cancer.
Author
===
Column {data-width=500}
---
### About the Author
My name is Alexa Neal and I am a current senior at the University of Dayton. I am pursuing a Bachelor of Science in Premedicine, and minors in Data Analytics, Medicine and Society, and Neuroscience. My projected graduation is May 2026, and I will be attending medical school in the Fall of 2026.
### AI Acknowledgement
### Works Cited
Park, S., & Fung, V. (2025). Health Care Affordability Problems by Income Level and Subsidy Eligibility in Medicare. JAMA network open, 8(9), e2532862. https://doi.org/10.1001/jamanetworkopen.2025.32862
Column {data-width=500}
---
###
```{r headshot}
knitr::include_graphics("C:/Users/write/OneDrive/Desktop/photos/headshot.JPG")
```